{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP Course 2 Week 1 Lesson : Building The Model - Lecture Exercise 01\n",
    "Estimated Time: 10 minutes\n",
    "<br>\n",
    "# Vocabulary Creation \n",
    "Create a tiny vocabulary from a tiny corpus\n",
    "<br>\n",
    "It's time to start small !\n",
    "<br>\n",
    "### Imports and Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "import re # regular expression library; for tokenization of words\n",
    "from collections import Counter # collections library; counter: dict subclass for counting hashable objects\n",
    "import matplotlib.pyplot as plt # for data visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "red pink pink blue blue yellow ORANGE BLUE BLUE PINK\n",
      "string length :  52\n"
     ]
    }
   ],
   "source": [
    "# the tiny corpus of text ! \n",
    "text = 'red pink pink blue blue yellow ORANGE BLUE BLUE PINK' # 🌈\n",
    "print(text)\n",
    "print('string length : ',len(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "red pink pink blue blue yellow orange blue blue pink\n",
      "string length :  52\n"
     ]
    }
   ],
   "source": [
    "# convert all letters to lower case\n",
    "text_lowercase = text.lower()\n",
    "print(text_lowercase)\n",
    "print('string length : ',len(text_lowercase))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['red', 'pink', 'pink', 'blue', 'blue', 'yellow', 'orange', 'blue', 'blue', 'pink']\n",
      "count :  10\n"
     ]
    }
   ],
   "source": [
    "# some regex to tokenize the string to words and return them in a list\n",
    "words = re.findall(r'\\w+', text_lowercase)\n",
    "print(words)\n",
    "print('count : ',len(words))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Vocabulary\n",
    "Option 1 : A set of distinct words from the text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'orange', 'blue', 'red', 'yellow', 'pink'}\n",
      "count :  5\n"
     ]
    }
   ],
   "source": [
    "# create vocab\n",
    "vocab = set(words)\n",
    "print(vocab)\n",
    "print('count : ',len(vocab))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add Information with Word Counts\n",
    "Option 2 : Two alternatives for including the word count as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'red': 1, 'pink': 3, 'blue': 4, 'yellow': 1, 'orange': 1}\n",
      "count :  5\n"
     ]
    }
   ],
   "source": [
    "# create vocab including word count\n",
    "counts_a = dict()\n",
    "for w in words:\n",
    "    counts_a[w] = counts_a.get(w,0)+1\n",
    "print(counts_a)\n",
    "print('count : ',len(counts_a))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Counter({'blue': 4, 'pink': 3, 'red': 1, 'yellow': 1, 'orange': 1})\n",
      "count :  5\n"
     ]
    }
   ],
   "source": [
    "# create vocab inlcuding word count using collections.Counter\n",
    "counts_b = dict()\n",
    "counts_b = Counter(words)\n",
    "print(counts_b)\n",
    "print('count : ',len(counts_b))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEqZJREFUeJzt3X2sXHd95/H3B8c0bAGlyLdNZPvGaOtCgYUkXEyi0G3KAkrStNnuZrvJtqTN7tYKDQLUJ9GHDYqqqtX+0d2GQFwvpElEgaXlQVZwCtFCNgnCIbZJHBIH1aJEsWIRE6iDSQp1+u0fc7zMTsaec++d62v//H5JR/c8/ObM98zM/cyZ35w5J1WFJKktz1vuAiRJ02e4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhp0ynLd8apVq2rdunXLdfeSdELasWPHN6tqZlK7ZQv3devWsX379uW6e0k6ISV5tE87u2UkqUGGuyQ1yHCXpAYZ7pLUIMNdkhrUO9yTrEjy5SS3jVmWJNcn2ZNkV5JzplumJGk+5rPn/k5g9xGWXQSs74aNwI2LrEuStAi9wj3JGuBngQ8cocmlwK01sA04LckZU6pRkjRPfffc/yfwO8A/HWH5auCxoem93TxJ0jKY+AvVJJcAT1TVjiQXHKnZmHnPufJ2ko0Mum2YnZ2dR5mj61nwTY87Xp9c0lLos+d+PvDzSb4OfBR4Y5IPjbTZC6wdml4DPD66oqraXFVzVTU3MzPx1AiSpAWaGO5V9btVtaaq1gGXA5+rql8eabYFuLI7auZc4EBV7Zt+uZKkPhZ84rAkVwNU1SZgK3AxsAd4GrhqKtVJkhZkXuFeVXcCd3bjm4bmF3DNNAuTJC2cv1CVpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBk0M9ySnJvlSkgeSPJTkujFtLkhyIMn93XDt0pQrSeqjz2X2vge8saoOJlkJ3JPk9qraNtLu7qq6ZPolSpLma2K4d9dHPdhNruyGWsqiJEmL06vPPcmKJPcDTwB3VNW9Y5qd13Xd3J7klVOtUpI0L73CvaqeraqzgDXAhiSvGmmyEzizql4DvBf41Lj1JNmYZHuS7fv3719M3ZKko5jX0TJV9ffAncCFI/OfqqqD3fhWYGWSVWNuv7mq5qpqbmZmZuFVS5KOqs/RMjNJTuvGXwC8CXhkpM3pSdKNb+jW++T0y5Uk9dHnaJkzgFuSrGAQ2h+rqtuSXA1QVZuAy4C3JTkEPANc3n0RK0laBn2OltkFnD1m/qah8RuAG6ZbmiRpofyFqiQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDWozzVUT03ypSQPJHkoyXVj2iTJ9Un2JNmV5JylKVeS1Eefa6h+D3hjVR1MshK4J8ntVbVtqM1FwPpueD1wY/dXkrQMJu6518DBbnJlN4xe/PpS4Nau7TbgtCRnTLdUSVJfffbcSbIC2AH8OPC+qrp3pMlq4LGh6b3dvH0j69kIbASYnZ1dYMni/25f7gqm56fnlrsCqUm9vlCtqmer6ixgDbAhyatGmmTczcasZ3NVzVXV3MzMzPyrlST1Mq+jZarq74E7gQtHFu0F1g5NrwEeX1RlkqQF63O0zEyS07rxFwBvAh4ZabYFuLI7auZc4EBV7UOStCz69LmfAdzS9bs/D/hYVd2W5GqAqtoEbAUuBvYATwNXLVG9kqQeJoZ7Ve0Czh4zf9PQeAHXTLc0SdJC+QtVSWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJalCfa6iuTfL5JLuTPJTknWPaXJDkQJL7u+HapSlXktRHn2uoHgJ+s6p2JnkRsCPJHVX18Ei7u6vqkumXKEmar4l77lW1r6p2duPfAXYDq5e6MEnSws2rzz3JOgYXy753zOLzkjyQ5PYkrzzC7Tcm2Z5k+/79++ddrCSpn97hnuSFwMeBd1XVUyOLdwJnVtVrgPcCnxq3jqraXFVzVTU3MzOz0JolSRP0CvckKxkE+19W1SdGl1fVU1V1sBvfCqxMsmqqlUqSeutztEyADwK7q+pPj9Dm9K4dSTZ0631ymoVKkvrrc7TM+cBbgQeT3N/N+z1gFqCqNgGXAW9Lcgh4Bri8qmoJ6pUk9TAx3KvqHiAT2twA3DCtoiRJi+MvVCWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBfa6hujbJ55PsTvJQkneOaZMk1yfZk2RXknOWplxJUh99rqF6CPjNqtqZ5EXAjiR3VNXDQ20uAtZ3w+uBG7u/kqRlMHHPvar2VdXObvw7wG5g9UizS4Fba2AbcFqSM6ZerSSpl3n1uSdZB5wN3DuyaDXw2ND0Xp77BkCSjUm2J9m+f//++VUqSeqtd7gneSHwceBdVfXU6OIxN6nnzKjaXFVzVTU3MzMzv0olSb31CvckKxkE+19W1SfGNNkLrB2aXgM8vvjyJEkL0edomQAfBHZX1Z8eodkW4MruqJlzgQNVtW+KdUqS5qHP0TLnA28FHkxyfzfv94BZgKraBGwFLgb2AE8DV02/VElSXxPDvaruYXyf+nCbAq6ZVlGSpMXxF6qS1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAb1uczeTUmeSPKVIyy/IMmBJPd3w7XTL1OSNB99LrN3M3ADcOtR2txdVZdMpSJJ0qJN3HOvqruAbx2DWiRJUzKtPvfzkjyQ5PYkr5zSOiVJC9SnW2aSncCZVXUwycXAp4D14xom2QhsBJidnZ3CXUuSxln0nntVPVVVB7vxrcDKJKuO0HZzVc1V1dzMzMxi71qSdASLDvckpydJN76hW+eTi12vJGnhJnbLJPkIcAGwKsle4D3ASoCq2gRcBrwtySHgGeDyqqolq1iSNNHEcK+qKyYsv4HBoZKSpOOEv1CVpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBk0M9yQ3JXkiyVeOsDxJrk+yJ8muJOdMv0xJ0nz02XO/GbjwKMsvAtZ3w0bgxsWXJUlajInhXlV3Ad86SpNLgVtrYBtwWpIzplWgJGn+ptHnvhp4bGh6bzdPkrRMTpnCOjJmXo1tmGxk0HXD7OzsFO5aJ52Me7mdoGrsv8kErWz/Arb9w61sO/CfFvLcz8809tz3AmuHptcAj49rWFWbq2ququZmZmamcNeSpHGmEe5bgCu7o2bOBQ5U1b4prFeStEATu2WSfAS4AFiVZC/wHmAlQFVtArYCFwN7gKeBq5aqWElSPxPDvaqumLC8gGumVpEkadH8haokNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1qFe4J7kwyVeT7Eny7jHLL0hyIMn93XDt9EuVJPXV5xqqK4D3AW8G9gL3JdlSVQ+PNL27qi5ZgholSfPUZ899A7Cnqr5WVd8HPgpcurRlSZIWo0+4rwYeG5re280bdV6SB5LcnuSVU6lOkrQgE7tlgIyZVyPTO4Ezq+pgkouBTwHrn7OiZCOwEWB2dnaepUqS+uqz574XWDs0vQZ4fLhBVT1VVQe78a3AyiSrRldUVZuraq6q5mZmZhZRtiTpaPqE+33A+iQvTfJ84HJgy3CDJKcnSTe+oVvvk9MuVpLUz8Rumao6lOTtwGeAFcBNVfVQkqu75ZuAy4C3JTkEPANcXlWjXTeSpGOkT5/74a6WrSPzNg2N3wDcMN3SJEkL5S9UJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUG9wj3JhUm+mmRPknePWZ4k13fLdyU5Z/qlSpL6mhjuSVYA7wMuAl4BXJHkFSPNLgLWd8NG4MYp1ylJmoc+e+4bgD1V9bWq+j7wUeDSkTaXArfWwDbgtCRnTLlWSVJPfcJ9NfDY0PTebt5820iSjpFTerTJmHm1gDYk2cig2wbgYJKv9rj/5bQK+OZS3kHGPXLHhyXf9uPc0m//8fvkH4Pn/mTeduCXFrX9Z/Zp1Cfc9wJrh6bXAI8voA1VtRnY3Kew40GS7VU1t9x1LIeTedvh5N5+t72Nbe/TLXMfsD7JS5M8H7gc2DLSZgtwZXfUzLnAgaraN+VaJUk9Tdxzr6pDSd4OfAZYAdxUVQ8lubpbvgnYClwM7AGeBq5aupIlSZP06ZahqrYyCPDheZuGxgu4ZrqlHRdOmC6kJXAybzuc3Nvvtjcgg1yWJLXE0w9IUoNOynBPsi7JV8bMvzNJE9+Uz1eSD4z55fFom5uTXHasajreJLkgyW3LXce0JTnY/R37f6ETU68+d7Wvqv7rctewXJKEQRflPy13LVo6J9vzfFLuuXdOSXJLd6Kzv07yL4YXHt6b6cYvS3JzNz6T5ONJ7uuG849x3YvS7Z09Mrrtw59akhxM8kdJHkiyLcmPjVnPH3Z78ifka6h7HHYneT+wE3hrki8m2Znkr5K8sGt3Yfd43QP8u2UtuqfuuXnn0PQfJXlHkt/uXrO7klw3YR2nJvmLJA8m+XKSn+nmb03y6m78y0muHbrPZd9BSPIbSb7SDe8a8zyvTXJjku1JHhp+HJJ8Pcl13WvgwSQv7+bPJLmjm//nSR5Nsqpb9stJvpTk/m7ZiuXZ8uc6If8xp+RlwOaqejXwFPDrPW/3Z8D/qKrXAf8e+MAS1beUJm37DwPbquo1wF3Arw0vTPLfgR8FrjrB94JeBtwKvBn4L8CbquocYDvwG0lOBf4X8HPATwGnL1eh8/RB4FcAujffy4FvMDix3wbgLOC1Sf71UdZxDUBV/SvgCuCW7vG4C/ipJC8GDgGHd27eANw9/U3pL8lrGRyG/XrgXAav2x+he56r6uyqehT4/e6HSq8Gfvrwm1Xnm91r4Ebgt7p57wE+183/JDDb3d9PAv8ROL+qzgKeBX5piTezt5O5W+axqvpCN/4h4B09b/cm4BX5wU/HX5zkRVX1nWkXuIQmbfv3gcN9yzsYhN9h/w24t6o2cuJ7tKq2JbmEwRlPv9A9r88Hvgi8HPi7qvpbgCQf4genzzhuVdXXkzyZ5Gzgx4AvA68D3tKNA7yQQdjfdYTVvAF4b7e+R5I8CvwEgwB/B/B3wKeBN3efetdV1XKfTuQNwCer6rsAST7B4E350e6Ehof9YganQjkFOIPBc7+rW/aJ7u8OfvBJ7Q3ALwBU1d8k+XY3/98ArwXu6143LwCeWILtWpCTOdxHjwE92vSpQ+PPA86rqmeWpKpjY9K2/2P94BjZZ/n/Xyf3Mdjre0lVfWupCjxGvtv9DXBHVV0xvDDJWYw5R9IJ4gPArzL4tHETgyD646r68563P9LJT+4D5oCvAXcwOBfLrzEIw+V2pJq/+/8aJC9lsEf+uqr6dtfdOvz//b3u7/Dr/kjrDXBLVf3ugiteQidzt8xskvO68SuAe0aWfyPJT3Yfa39haP5ngbcfnugC4EQzaduP5m+APwE+neRFU69seWwDzk/y4wDddxA/ATwCvDTJv+zaXXGkFRyHPglcyGCP/TPd8J+HvktYneRHj3L7u+i6GLrHYhb4anfa78eAX2TwuN3NICyXtUumcxfwb7vn74cZ/N+O1vViBmF/oPsu6aIe672HwfaS5C0MunoA/g9w2eHHMclLkvQ6qdexcDKH+27gV5LsAl7Ccy8w8m4GXROfA4bPk/MOYK77Uuph4OpjUeyUTdr2o6qqv2LQF70lyQuWoL5jqqr2M9jL/Uj3mGwDXl5V/8CgG+bT3Reqjy5flfPThfDngY9V1bNV9Vngw8AXkzwI/DVwtDfn9wMrurb/G/jVqjq8V3s38I2qerobX8NxEO5VtRO4GfgScC+DTy/fHmnzAIOuqYcYfKL5ApNdB7wlyU4Gbwb7gO9U1cPAHwCf7V43dzDo5jku+AvVk0ySdcBtVfWqZS5FS6j7xLkT+A+HvzPQwiT5IeDZ7jxb5wE3dl+gHtdO5j53qUkZ/BjtNgZfLhrsizcLfKx7w/w+I0ePHa/cc5ekBp3Mfe6S1CzDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQf8MjJQFBhqpBowAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# barchart of sorted word counts\n",
    "d = {'blue': counts_b['blue'], 'pink': counts_b['pink'], 'red': counts_b['red'], 'yellow': counts_b['yellow'], 'orange': counts_b['orange']}\n",
    "plt.bar(range(len(d)), list(d.values()), align='center', color=d.keys())\n",
    "_ = plt.xticks(range(len(d)), list(d.keys()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ungraded Exercise\n",
    "Note that `counts_b`, above, returned by `collections.Counter` is sorted by word count\n",
    "\n",
    "Can you modify the tiny corpus of ***text*** so that a new color appears \n",
    "between ***pink*** and ***red*** in `counts_b` ?\n",
    "\n",
    "Do you need to run all the cells again, or just specific ones ? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "counts_b :  Counter({'blue': 4, 'pink': 3, 'red': 1, 'yellow': 1, 'orange': 1})\n",
      "count :  5\n"
     ]
    }
   ],
   "source": [
    "print('counts_b : ', counts_b)\n",
    "print('count : ', len(counts_a))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Expected Outcome:\n",
    "\n",
    "counts_b : Counter({'blue': 4, 'pink': 3, **'your_new_color_here': 2**, red': 1, 'yellow': 1, 'orange': 1})\n",
    "<br>\n",
    "count :  6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary\n",
    "\n",
    "This is a tiny example but the methodology scales very well.\n",
    "<br>\n",
    "In the assignment you will create a large vocabulary of thousands of words, from a corpus\n",
    "<br>\n",
    "of tens of thousands or words! But the mechanics are exactly the same. \n",
    "<br> \n",
    "The only extra things to pay attention to will be; run time, memory management and the vocab data structure.\n",
    "<br> \n",
    "So the choice of approach used in code blocks `counts_a` vs `counts_b`, above, will be important."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
